Classifying customer reviews with spaCy v3.0
This post shows how to train models that can automatically identify the sentiment and product category of customer reviews. With the help of the recently released spaCy v3.0, text classification is an absolute breeze.

Text classification is a very common NLP task. Given enough training data, it's relatively easy to build a model that can automatically classify previously unseen texts in a way that follows the logic of the training data. In this post, I'll go through the steps for building such a model. Specifically, I'll leverage the power of the recently released spaCy v3.0 to train two classification models, one for identifying the sentiment of customer reviews in Chinese as being positive or negative (i.e. binary classification) and the other for predicting their product categories in a list of five (i.e. multiclass classification). If you can't wait to see how spaCy v3.0 has made the training process an absolute breeze, feel free to jump to the training the textcat component with CLI section. If not, bear with me along this long journey.
I'm hoping to build classification models that can take traditional Chinese texts as input, but I can't find any publicly available datasets of customer reviews in traditional Chinese. So I had to make do with reviews in simplified Chinese. Let's first download the dataset using !wget.
!wget https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/online_shopping_10_cats/online_shopping_10_cats.zip
Then we unzip the downloaded file online_shopping_10_cats.zip with, surprisingly, !unzip.
!unzip online_shopping_10_cats.zip
The dataset has three columns: review for review texts, label for sentiment , and cat for product categories. Here's a random sample of five reviews.
import pandas as pd
file_path = '/content/online_shopping_10_cats.csv'
df = pd.read_csv(file_path)
df.sample(5)
There're in total 62774 reviews.
df.shape
The label column has only two unique values, 1 for positive reviews and 0 for negative ones.
df.label.unique()
The cat column has nine unique values.
df.cat.unique()
Before moving on, let's save the raw dataset to Google Drive. The dest variable can be any GDrive path you like.
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/"
!cp {file_path} {dest}
Now let's do some data filtering. The groupby function from pandas is very useful, and here's how to get the counts of each of the unique values in the cat column.
df.groupby(by='cat').size()
To create a balanced dataset, I decided to keep categories whose counts are 10,000. So we're left with five product categories, 平板 for tablets, 水果 for fruits, 洗发水 for shampoo, 衣服 for clothing, and finally 酒店 for hotels.
There're many ways to filter data in pandas, and my favorite is to first create a filt variable that holds a list of True and False, which in this particular case is whether the value in the cat volumn is in the cat_list variable for the categories to be kept. Then we can simply filter data with df[filt]. After filtering, the dataset is reduced to 50,000 reviews.
cat_list = ['平板', '水果', '洗发水', '衣服', '酒店']
filt = df['cat'].isin(cat_list)
df = df[filt]
df.shape
Now, the dataset is balanced in terms of both the cat and label columnn. There're 10,000 reviews for each product category.
df.groupby(by='cat').size()
And there're 25,000 for either of the two sentiments.
df.groupby(by='label').size()
Having made sure the filtered dataset is balanced, we can now reset the index, and save the dataset as online_shopping_5_cats_sim.csv.
df.reset_index(inplace=True, drop=True)
df.to_csv(dest+"online_shopping_5_cats_sim.csv", sep=",", index=False)
Let's load back the file we just saved to make sure the dataset is accessible for later use.
df = pd.read_csv(dest+"online_shopping_5_cats_sim.csv")
df.tail()
Next, I converted the reviews from simplified Chinese to traditional Chinese using the OpenCC library.
!pip install OpenCC
OpenCC has many conversion methods. I specifically used s2twp, which converts simplified Chinese to traditional Chinese adpated to Taiwanese vocabulary. The adaptation is not optimal, but it's better than mechanic simplified-to-traditional conversion. Here's a random review in the two writing systems.
from opencc import OpenCC
cc = OpenCC('s2twp')
test = df.loc[49995, 'review']
print(test)
test_tra = cc.convert(test)
print(test_tra)
Having made sure the conversion is correct, we can now go ahead and convert all reviews.
df.loc[ : , 'review'] = df['review'].apply(lambda x: cc.convert(x))
Let's make the same change to the cat column.
df.loc[ : , 'cat'] = df['cat'].apply(lambda x: cc.convert(x))
And then we save the converted dataset as online_shopping_5_cats_tra.csv.
df.to_csv(dest+'online_shopping_5_cats_tra.csv', sep=",", index=False)
Let's load back the file just saved to make sure it's accessible in the future.
df = pd.read_csv(dest+'online_shopping_5_cats_tra.csv')
df.tail()
Before building models, I would normally inspect the dataset. There're many ways to do so. I recently learned that there's a trick on Colab which allows you to filter a dataset in an interactive manner. All it takes is three lines of code.
%load_ext google.colab.data_table
from google.colab import data_table
data_table.DataTable(df, include_index=False, num_rows_per_page=10)
Alternatively, if you'd like to see some sample reviews from all the categories, the groupby function is quite handy. The trick here is to feed pd.DataFrame.sample to the apply function so that you can specify the number of reviews to inspect from each product category.
df.groupby('cat').apply(pd.DataFrame.sample, n=3)[['label', 'review']]
Finally, one of the most powerful ways of exploring a dataset is to use the facets-overview library. Let's first create a column for the length of review texts.
df['len'] = df['review'].apply(len)
df.tail()
Then we install the library.
!pip install facets-overview
In order to render an interative visualization of the dataset, we first convert the DataFrame object df to the json format and then add it to an HTML template, as shown below. If you choose len for Binning | X-Axis, cat for Binning | Y-Axis, and finally review for Label By, you'll see all the reviews are beautifully arranged in term of text length along the X axis and product categories along the Y axis. They're also color-coded with respect to sentiment, blue for positive and red for negative. Clicking on a point of either color shows the values of that particular datapoint. Feel free to play around.
from IPython.core.display import display, HTML
jsonstr = df.to_json(orient='records')
HTML_TEMPLATE = """
<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
<link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
<facets-dive id="elem" height="600"></facets-dive>
<script>
var data = {jsonstr};
document.querySelector("#elem").data = data;
</script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
display(HTML(html))